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Abstract. Using an auxiliary memory smaller than the size of this ab- 
stract, the LogLog algorithm makes it possible to estimate in a single 
pass and within a few percents the number of different words in the 
whole of Shakespeare's works. In general the LogLog algorithm makes 
use of m "small bytes" of auxiliary memory in order to estimate in a 
single pass the number of distinct elements (the "cardinality") in a file, 
and it does so with an accuracy that is of the order of 1/ ^fm. The "small 
bytes" to be used in order to count cardinalities till 7V max comprise about 
loglogiV max bits, so that cardinalities well in the range of billions can be 
determined using one or two kilobytes of memory only. The basic version 
of the LogLog algorithm is validated by a complete analysis. An opti- 
mized version, super-LOGLOG, is also engineered and tested on real-life 
data. The algorithm parallelizes optimally. 

1 Introduction 

The problem addressed in this note is that of determining the number of distinct 
elements, also called the cardinality, of a large file. This problem arises in several 
areas of data-mining, database query optimization, and the analysis of traffic in 
routers. In such contexts, the data may be either too large to fit at once in core 
memory or even too massive to be stored, being a huge continuous flow of data 
packets. For instance, Estan et al. |3] report traces of packet headers, produced 
at a rate of 0.5GB per hour of compressed data (!), which were collected while 
trying to trace a "worm" (Code Red, August 1 to 12, 2001), and on which it 
was necessary to count the number of distinct sources passing through the link. 
We propose here the LogLog algorithm that estimates cardinalities using only 
a very small amount of auxiliary memory, namely m memory units, where a 
memory unit, a "small byte", comprises close to loglogAVax bits, with -/V max 
an a priori upperbound on cardinalities. The estimate is (in the sense of mean 
values) asymptotically unbiased; the relative accuracy of the estimate (measured 
by a standard deviation) is close to 1.05/i/m for our best version of the algo- 
rithm, Super-LOGLOG. For instance, estimating cardinalities till iV max — 2 27 (a 
hundred million different records) can be achieved with m = 2048 memory units 
of 5 bits each, which corresponds to 1.28 kilobytes of auxiliary storage in total, 
the error observed being typically less than 2.5%. Since the algorithm operates 
incrementally and in a single pass it can be applied to data flows for which it 
provides on-line estimates available at any given time. Advantage can be taken 
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of the low memory consumption in order to gather simultaneously a very large 
number of statistics on huge heterogeneous data sets. The LogLog algorithm 
can also be fully distributed or parallelized, with optimum speed-up and mini- 
mal interprocess communication. Finally, an embedded hardware design would 
involve strictly minimal resources. 

Motivations. A traditional application of cardinality estimates is database 
query optimization. There, a complex query typically involves a variety of set- 
theoretic operations as well as projections, joints, and so on. In this context, 
knowing "for free" cardinalities of associated sets provides a valuable guide for 
selecting an efficient processing strategy best suited to the data at hand. Even a 
problem as simple as merging two large files with duplicates can be treated by 
various combinations of sorting, straight merging, and filtering out duplicates 
(in one or both of the files); the cost function of each possible strategy is then 
determined by the number of records as well as by the cardinality of each file. 
Probabilistic estimation algorithms also find a use in large data recording and 
warehousing environments. There, the goal is to provide an approximate response 
in time that is orders-of-magnitude less than what computing an exact answer 
would require: see the description of the Aqua Project by Gibbons et al. in [8). 

The analysis of traffic in routers, as already mentioned, benefits greatly of 
cardinality estimators — this is lucidly exposed by Estan et al. in f2|3] . Certain 
types of attacks ("denial of service" and "port scans") are betrayed by alarmingly 
high counts of certain characteristic events in routers. In such situations, there is 
usually not enough resource available to store and search on-line the very large 
number of events that take place even in a relatively small time window. 

Probabilistic counting algorithms can also be used within other algorithms 
whenever the final answer is the cardinality of a large set and a small tolerance 
on the quality of the answer is acceptable. Palmer et al. [T3] describe the use of 
such algorithms in an extensive connectivity analysis of the internet topology. 
For instance, one of the tasks needed there is to determine, for each distance h, 
the number of pairs of nodes that are at distance at most h in the internet graph. 
Since the graph studied by [13] has close to 300,000 nodes, the number of pairs 
to be considered is well over 10 10 , upon which costly list operations must be 
performed by exact algorithms. In contrast an algorithm that would be, in the 
abstract, suboptimal can be coupled with adapted probabilistic counting tech- 
niques and still provide reliable estimates. In this way, the authors of |13| were 
able to extract extensive metric information on the internet graph by keeping a 
reduced collection of data that reside in core memory. They report a reduction 
in run-time by a factor of more than 400. 

Algorithms. The LogLog algorithm is probabilistic. Like in many similar 
algorithms, the first idea is to appeal to a hashing function in order to randomize 
data and bring them to a form that resembles random (uniform, independent) 
binary data. It is this hashed data set that is distilled into cardinality estimates 
by the algorithm. Various algorithms perform various tests on the hashed data 
set, then compare "observables" to what probabilistic analysis predicts, and 
finally "deduce" a plausible value of the parameter of interest. In the case of 
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The LogLog Algorithm with m = 256 condenses the whole of Shakespeare's works 
to a table of 256 "small bytes" of 4 bits each. The estimate of the number of distinct 
words is here n° = 30897 (true answer: n = 28239), i.e., a relative error of +9.4%. 

LogLog counting, the observable should only be linked to cardinality, and hence 
be totally independent of the nature of replications and the ordering of data 
present in the file, on which no information at all is available. (Depending on 
context, collisions due to hashing can either be neglected or their effect can be 
estimated and corrected.) 

Whang, Zanden, and Taylor |16J have developed Linear Counting, which dis- 
tributes (hashed) values into buckets and only keeps a bitmap indicating which 
buckets are hit. Then observing the number of hits in the table leads to an es- 
timate of cardinality. Since the number of buckets should not be much smaller 
than the cardinalities to be estimated (say, > -/V max /10), the algorithm has space 
complexity that is 0(iV max ) (typically, iV max /10 bits of storage). The linear space 
is a drawback whenever large cardinalities, multiple counts, or limited hardware 
are the rule. Estan, Varghese, and Fisk [3] have devised a multiscale version of 
this principle, where a hierarchical collection of small windows on the bitmap 
is kept. From simulation data, their Multiresolution Bitmap algorithm appears 
to be about 20% more accurate than Probabilistic Counting (discussed below) 
when the same amount of memory is used. The best algorithm of |2] for flows 
in routers, Adaptive Bitmap, is reported to be about 3 times more efficient than 
cither Probabilistic Counting or Multiresolution Bitmap, but it has the dis- 
advantage of not being universal, as it makes definite statistical assumptions 
( "stationarity" ) regarding the data input to the algorithm. (We recommend the 
thorough engineering discussion of [3].) 

Closer to us is the Probabilistic Counting algorithm of Flajolct and Mar- 
tin [Zj. This uses a certain observable that has excellent statistical properties 
but is relatively costly to maintain in terms of storage. Indeed, Probabilistic 
Counting estimates cardinalities with an error close to 0.78/ \pm given a table 
of m "words", each of size about log 2 iV max . 

Yet another possible idea is sampling. One may use any filter on hashed 
values with selectivity p C 1, store exactly and without duplicates the data 
items filtered and return as estimate 1/p times the corresponding cardinality. 
Wegner's Adaptive Sampling (described and analyzed in |S]) is an elegant way 
to maintain dynamically varying values of p. For m "words" of memory (where 
here "word" refers to the space needed by a data item), the accuracy is about 
1.20/ \fm, which is about 50% less efficient than Probabilistic Counting. 

An insightful complexity-theoretic discussion of approximate counting is pro- 
vided by Alon, Matias, and Szegedy in [T]. The authors discuss a class of 
"frequency-moments" statistics which includes ours (as their F 0 statistics) . Our 
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LogLog Algorithm has principles that evoke some of those found in the inter- 
section of [l] and the earlier [7], but contrary to [l|, we develop here a complete 
eminently practical algorithmic solution and provide a very precise analysis, 
including bias correction, error and risk evaluation, as well as complete dimen- 
sioning rules. 

We estimate that our LogLog algorithm outperforms the earlier Probabilis- 
tic Counting algorithm and the similarly performing Multiresolution Bitmap 
°f E by a factor of 3 at least as it replaces "words" (of 16 to 32 bits) by "small 
bytes" of typically 5 bits each, while being based on an observable that has 
only slightly higher dispersion than the other two algorithms — this is expressed 
by our two formulae 1.30 /y/m (LogLog) and 1.05 /y/m (super-LocLoG). This 
places our algorithm in the same category as Adaptive Bitmap of [3]. However, 
compared to Adaptive Bitmap, the LogLog algorithm has the great advantage 
of being universal as it makes no assumptions on the statistical regularity of 
data. We thus believe LogLog and its improved version Super-LOGLOG to be 
the best general-purpose algorithmic solution currently known to the problem 
of estimating large cardinalities. 

Note. The following related references were kindly suggested by a referee: Cormode et 
al., in VLDB-2002 (a new counting method based on stable laws) and Bar-Yossef et 
al., SODA-2002 (a new application to counting triangles in graphs). 



2 The Basic LogLog Algorithm 

In computing practice, one deals with a multiset of data items, each belonging to 
a discrete universe U. For instance, in the case of natural text, U may be the set 
of all alphabetic strings of length < 28 ('antidisestablishmentarianism'), double 
floats represented on 64 bits, and so on. A multiset Wl of elements of U is given 
and the problem is to estimate its cardinality, that is, the number of distinct 
elements it comprises. Here is the principle of the basic LogLog algorithm. 

Algorithm LogLog(9JI: Multiset of hashed values; m = 2 k ) 
Initialize M«,... , to 0; 

let p(y) be the rank of first 1-bit from the left in y; 
for x = bib 2 ■ ■ ■ € Wt do 

set j := (&!••• 6fe)2 (value of first k bits in base 2) 
set M« := mnx(M^\ p(b k+1 b k+2 ■■■); 
— V 

return E := a m m2 m as cardinality estimate. 



We assume throughout that a hash function, h, is available that transforms 
elements of IA into sufficiently long binary strings, in such a way that bits com- 
posing the hashed value closely resemble random uniform independent bits. This 
pragmatic attitude^ is justified by Knuth who writes in |10| : 11 It is theoretically 

1 The more theoretically inclined reader may prefer to draw h at random from a family 
of universal hash functions; see, e.g., the general discussion in [12] and the specific [Jj. 
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impossible to define a hash function that creates random data from non-random 
data in actual files. But in practice it is not difficult to produce a pretty good im- 
itation of random data." Given this, we formalize our basic problem as follows. 

Take U = {0, 1}°° as the universe of data endowed with the uniform (prod- 
uct) probability distribution. An ideal multiset 97t of cardinality n is a ran- 
dom object that is produced by first drawing an n-sequence independently 
at random from U, then replicating elements in an arbitrary way, and finally, 
applying an arbitrary permutation. 

The user is provided with the (extremely large) ideal multiset 971 and its goal 
is to estimate the (unknown to him) value of n at a small computational cost. 
No information is available, hence no statistical assumption can be made, 
regarding the behaviour of the replicator-shuffler daemon. 

(The fact that we consider infinite data is a convenient abstraction at this stage; 
we discuss its effect, together with needed adjustments, in Section [5] below.) 

The basic idea consists in scanning 971 and observing the patterns of the form 
0*1 that occur at the beginning of (hashed) records. For a string x E {0, 1}°°, 
let p(x) denote the position of its first 1-bit. Thus p(l •••) = !, p(001 • • • ) = 3, 
etc. Clearly, we expect about n/2 k amongst the distinct elements of 971 to have 
a p- value equal to k. In other words, the quantity, 

R(m) := maxflfi), 

can reasonably be hoped to provide a rough indication on the value of log 2 n. It 
is an "observable" in the sense above since it is totally independent of the order 
and the replication structure of the multiset 971. In fact, in probabilistic terms, 
the quantity R is precisely distributed in the same way as 1 plus the maximum 
of n independent geometric variables of parameter |. This is an extensively 
researched subject; see, e.g., [Hj. It turns out that R estimates log 2 n with an 
additive bias of 1.33 and a standard deviation of 1.87. Thus, in a sense, the 
observed value of R estimates "logarithmically" n within ±1.87 binary orders 
of magnitude. Notice however that the expectation of 2 R is infinite so that 2 R 
cannot in fact be used to estimate n. 

The next idea consists in separating elements into m groups also called "buck- 
ets" , where m is a design parameter. With m = 2 k , this is easily done by using 
the first k bits of x as representing in binary the index of a bucket. One can 
then compute the parameter R on each bucket, after discarding the first k bits. 
If is the (random) value of parameter R on bucket number j, then the 

arithmetic mean ^ X^Jli -^^j can legitimately be expected to approximate 
log 2 (n/m) plus an additive bias. The estimate of n returned by the LogLog 
algorithm is accordingly 

E:=a m m2^ M( "\ (1) 
The constant a m comes out of our later analysis as a m := 

r(— 1/m) "'jog 2 J ' wnere R( s ) : = j Jo°° e ~ t t s dt. It precisely corrects 
the systematic bias of the raw arithmetic mean in the asymptotic limit. One 
may also hope for a greater concentration of the estimates, hence better 
accuracy, to result from averaging over m 3> 1 values. The main characteristics 
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of the algorithm are summarized below in Theorem [TJ The letters E,V denote 
expectation and variance, and the subscript n indicates the cardinality of the 
underlying random multiset. 

Theorem 1. Consider the basic LogLog algorithm applied to an ideal multi- 
set of (unknown) cardinality n and let E be the estimated value of cardinality 
returned by the algorithm. 

(i) The estimate E is asymptotically unbiased in the sense that, as n — > oo, 

-EJE) = l + 6 ln + o(l), where |0 X J < 1CT 6 . 
n 

(ii) The standard error defined as ^y/Y n (E) satisfies as n — > oo, 



-6 



where \02,n\ < 10 

One has: (3 128 = 1.30540, (3^ = yj ± log 2 2 + ±tt 2 = 1.29806. 

In summary, apart from completely negligible fluctuations whose amplitude is 
less than 10~ 6 , the algorithm provides asymptotically a valid estimator of n. The 
standard error, which measures in a mean-quadratic sense and in proportion to n 
the deviations to be expected, is closely approximated by the formulall 

Standard error m 

V m 

For instance, m — 256 and m — 1024 give a standard error of 8% and 4% 
respectively. (These figures are compatible with what was obtained on the 
Shakespeare data.) Observe also that a m ~ — {2tt 2 + log 2 2)/(48m), where 
ctoo = e _7 -\/2/2 = 0.39701 (7 is Euler's constant), so that, in practical imple- 
mentations, a m can be replaced by without much detectable bias as soon as 
m > 64. 

The proof of Theorem [TJ will occupy the whole of the next section. 



3 The Basic Analysis 

Throughout this note, the unknown number of distinct values in the data set is 
denoted by n. The LogLog algorithm provides an estimator, E, of n. We first 
provide formula? for the expectation and variance of E. Asymptotic analysis 
is performed next: The Poissonization paragraph introduces the Poisson model 
where n is allowed to vary according to a Poisson law, while the Depoissoniza- 
tion paragraph shows the Poisson model to be asymptotically equivalent to the 
"fixed-n" model that we need. The expected value of the estimator is found to 
be asymptotically n, up to minute fluctuations. This establishes the asymptot- 
ically unbiased character of the algorithm as asserted in (i) of Theorem [TJ The 
standard deviation of the estimator is also proved to be of the order of n with 
the proportionality coefficient providing the value of the standard error, hence 
the accuracy of the algorithm, as asserted in (ii) of Theorem [JJ 

2 We use '~' to denote asymptotic expansions in the usual mathematical sense and 
reserve the informal for "approximately equal". 
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Fig. 1. The distribution of observed register values for the Pi file, n w 2 ■ 10 7 with 
m = 1024 [left]; the distribution P„(M = ft) of a register M, forv = 2- 10 4 [right]. 

We start by examining what happens in a bucket that receives v elements 
(Figure [lj. The random variable M is, we recall, the maximum of v random 
variables that are independent and geometrically distributed according to P(Y > 
k) = 2fc 1 _ 1 . Consequently, the probability distribution of M is characterized 
by P„(M <k)= (1 - so that P„(M = k) = (l - ±)" - (l - . 

The bivariate (exponential) generating function of this family of probability 
distributions as v varies is then 

G(z,u) := ^P,(M = fc)u fc ^. = ][> fc ( e -(i"V2 fe ) _ ^(l-i/a*- 1 )) ; (2) 

v,k k 

as shown by a simple calculation. The starting point of the analysis is an expres- 
sion in terms of G of the mean and variance of Z := E/a rn = m2™ ^ M 3 , which 
is the unnormalized version of the estimator E. With the expression [z n ]f(z) rep- 
resenting the coefficient of z n in the power series f(z), we state: 

Lemma 1. The expected value and variance of the unnormalized estimator Z 
are E n (Z) = mn\[z n ]G (^, 2 1 /™) m ; and 

V„ ( Z) = m 2 n\ [z n ] (G ( i , 2 2 / m ) ) n -(md G ( £ , 2 1 /™) m ) 2 

Proof. The multinomial convolution relations corresponding to mth powers of 
generating functions imply that n\[z n ]G(z/m,u) m is the probability generating 
function of ■ (The multinomials enumerate all ways of distributing ele- 

ments amongst buckets.) The expressions for the first and second moment of Z 
are obtained from there by substituting u M- 2 1 /™ 1 and u M> 2 2 / m . 
Proving Theorem Q] is reduced to estimating asymptotically these quantities. 

Poissonization. We "poissonize" the problem of computing the expected value 
and the variance. In this way, calculations take advantage of powerful properties 
of the Mellin transform. The Poisson law of rate A is the law of a random 
variable X such that F(X = £) = e^^jf. Given a class A4 S of probabilistic 
models indexed by integers s, poissonizing means considering the "supermodel" 
where model A4 S is chosen according to a Poisson law of rate A. Since the 
poisson model of a large parameter A is predominantly a mixture of models M s 
with s near A (the Poisson law is "concentrated" near its mean) , one can expect 
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properties of the fixed-n model Ai n to be reflected by corresponding properties 
of the Poisson model taken with rate A = n. 

A useful feature is that expressions of moments and probabilities under the 
Poisson model are closely related to exponential generating functions of the 
fixed-n models. This owes to the fact that if f(z) = ^ B /„z"/n! is the expo- 
nential generating function of expectations of a parameter, then the quantity 
e _A /(A) = J2 n f n e~ x ^r gives the corresponding expectation under the Pois- 
son model. In this way, one sees that the quantities £ n = mG (^,2 1 / m )™e" n 

and V n = m 2 G (^,2 2 / m ) m e~ n - (mG (^, 2 1 ' m ) m e"") 2 are respectively the 
mean and variance of Z when the cardinality of the underlying multiset obeys 
a Poisson law of rate A = n. 

Lemma 2. The Poisson mean and variance £ n and V n satisfy as n —¥ oo: 
1 - 2 1 ' r > 



r(-l/m) 
r(-2/m) 



log 2 

1 - 2 2 /'"' 
log 2 



- n-l/m) 



1 - 2- 1 ' r - 
log 2 



where \e n \ and \rj n \ are bounded by 10 6 . 
The proof crucially relies on the Mellin transform 6j. 

Depoissonization. Finally, the asymptotic forms of the first two moments 
of the LogLog estimator can be transferred back from the Poisson model to 
the fixed-n model that underlies Theorem [I] The process involved is known as 
"depoissonization" . Various options are discussed in Chapter 10 of Szpankowski's 
book [TS]. We choose the method called "analytic depoissonization" by Jacquet 
and Szpankowski, whose underlying engine is the saddle point method applied to 
Cauchy integrals; see 9 15]. In essence, the values of an exponential generating 
function at large arguments are closely related to the asymptotic form of its 
coefficients provided the generating function decays fast enough away from the 
positive real axis in the complex plane. The complete proof is omitted. 

Lemma 3. The first two moments of the LogLog estimator are asymptotically 
equivalent under the Poisson and fixed-n model: E n (Z) ~ £ n , and Y n (Z) ~ V n . 

Lemmas [2] and |3] together prove Theorem [B Easy numerical calculations and 
straight asymptotic analysis of [3 m conclude the evaluations stated there. 



4 Space Requirements 

Now that the correctness — the absence of bias as well as accuracy — of the basic 
LogLog algorithm has been established, there remains to see that it performs 
as promised and only consumes O(loglogn) bits of storage if counts till n are 
needecd. 

3 A counting algorithm exhibiting a log-log feature in a different context is Morris's 
Approximate Counting [ll] analyzed in [4]. 
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In its abstract form of Section [TJ the LogLog algorithm operates with po- 
tentially unbounded integer registers and it consumes m of these. What we call 
an ^-restricted algorithm is one in which each of the M") registers is made of I 
bits, that is, it can store any integer between 0 and 2 e — 1. We state a shallow 
result only meant to phrase mathematically the log-log property of the basic 
space complexity: 

Theorem 2. Let w(n) be a function that tends to infinity arbitrarily slowly and 
consider the function liri) = log 2 log 2 (^H + ui(n). Then, the £(n) -restricted 
algorithm and the LOGLOG algorithm provide the same output with probability 
tending to 1 as n tends to infinity. 

The auxiliary tables maintained by the algorithm then comprise m "small bytes" , 
each of size £(n). In other words, the total space required by the algorithm in 
order to count till n is mlog 2 log 2 (^) (1 + o(l)) . The hashing function needs 
to hash values from the original data universe onto exactly 2 e ^ + log 2 m bits. 
Observe also that, whenever no discrepancy is present at the value n itself, the 
restricted algorithm automatically provides the right answer for all values n' < n. 

The proof of this theorem results from tail properties of the multinomial 
distributions and of maxima of geometric random variables. 

Assume for instance that we wish to count cardinalities till 2 27 , that is, over 
a hundred million, with an accuracy of about 4%. By Theorem [TJ one should 
adopt m = 1024 = 2 10 . Then, each bucket is visited roughly n/m — 2 17 times. 
One has log 2 log 2 2 17 = 4.09. Adopt uo — 0.91, so that each register has a size 
of I — 5 bits, i.e., a value less than 32. Applying the upperbound of the overall 
probability failure shows that an ^-restriction will have little incidence on the 
result: the probability of a discrepancyQ is lower than 12%. In summary: The 
basic LOGLOG counting algorithm makes it possible to estimate cardinalities 
till 10 s with a standard error of 4% using 1024 registers of 5 bits each, that is, 
a table of 640 bytes in total. 

5 Algorithmic Engineering 

In this section, we describe a concrete implementation of the LogLog algorithm 
that incorporates the probabilistic principles seen in previous sections. At the 
same time, we propose an optimization that has several beneficial effects: (i) it 
increases at no extra cost the accuracy of the results, i.e., it decreases the disper- 
sion of the estimates around the mean value; (ii) it allows for the use of smaller 
register values, thereby improving the storage utilization of the algorithm and 
nullifying the effect of length restriction discussed in Section |4] 

The fundamental probability distribution is that of the value of the M— 
register in a bucket that receives v elements (where v « n/m). This is the 

4 In addition, a correction factor, calculated according to the principles of Section[3] 
could easily be built into the algorithm, in order to compensate the small bias induced 
by restriction 
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Fig. 2. The evolution of the estimate (divided by the current value of n) provided by 
super-LOGLOG on all of Shakespeare's works: (left) words; (right) pairs of consecutive 
words. Here m = 256 (standard error=6.5%). 

maximum of v geometric random variables with mean close to log 2 n. The 
tails of this distribution, though exponential, arc still relatively "soft", as there 
holds V V (M > log 2 v + k) 2~ k . Since the estimate returned involves an expo- 
nential of the arithmetic mean of bucket registers, a few exceptional values may 
still distort the estimate produced by the algorithm, while more tame data will 
not induce this effect. Altogether, this phenomenon lies at the origin of a natural 
dispersion of estimates produced by the algorithm, hence it places a limit on the 
accuracy of cardinality estimates. A simple remedy to the situation consists in 
using truncation: 

Truncation Rule. When collecting register values in order to produce the 
final estimate, retain only the m 0 := [$o TO J smallest values and discard the 
rest. There 9o is a real number between 0 and 1, with 9q = 0.7 producing 
near-optimal results. The mean of these registers is computed and the esti- 
mate returned is moa m 2 m o , where E* indicates the truncated sum. 

The modified constant a m ensures that the algorithm remains unbiased. 

When the truncation rule is applied, accuracy does increase. An empirically 
determined formula for the standard error is ^ffi, when the Truncation Rule 
with 8q = 0.7 is employed. 

Empirical studies justify the fact that register values may be ceiled at the 
value |~log 2 (5;)] + S, without detectable effect for S = 3. In other words, one 
may freely combine the algorithm with restriction as follows: 

Restriction Rule. Use register values that are in the interval [0 . .B), where 



For instance for the data at the end of Section 2] with n = 2 27 , m = 1024, 
the value B — 20 (encoded on 5 bits) is sufficient. But now, the probability that 
length-restriction affects the estimate of the algorithm drops tremendously. 

Fact 1. Combining the basic LOGLOG counting algorithm, the Trun- 
cation Rule and the Restriction Rule yields the super-LOGLOG al- 
gorithm that estimates cardinalities with a standard error of « 
^= when m "small bytes" are used. Here a small byte has size 

[log 2 [log 2 ( N ™* ) + 3]] , that is, 5 bits for maximum cardinalities iV max 
well over 10 s . 



riog 2 (^)+3l<5. 
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Length of the hash function and collisions. The length H of the hash 
function — how many bits should it produce? — is guided by previous consid- 
erations. There must be log 2 m bits reserved for bucketing and the bound on 
register values should be at least as large as the quantity B above. Accordingly 
this value H must satisfy: H > Hq, where Hq := log 2 m + |~log 2 ( flf °" ) + 3] . In 
case a value too close to Hq is adopted (say 0 < H — Ho < 3), then the effect 
of hashing collisions must be compensated for. This is achieved by inverting the 
function that gives the expected value of the number of collisions in a hash table 
(see [I3J16| for an analogous discussion). The estimator is then to be changed 

into — 2 H log ^1 — 9L S0P L 2™ ^* M<3) ^ . (No detectable degradation of performance 
results from the last modification of the estimator function, and it can safely be 
used in all cases.) 

Risk analysis. For the pure LogLog algorithm, the estimate is an empirical 
mean of random variables that are approximately identically distributed (up 
to statistical fluctuations in bucket sizes). From there, it can be proved that 
the quantity ~ J~*. M^> is numerically closely approximated by a Gaussian. 
Consequently, the estimate returned is very roughly Gaussian: at any rate, it 
has exponentially decaying tails. (In principle, a full analysis would be feasible.) 
A similar property is expected for the super-LOGLOG algorithm since it is based 
on the same principles. As a consequence, we obtain the following pragmatic 
conclusion: 

Fact 2. Let a := -55. The estimate is within <r, 2a, and 3<r of the exact 
value of the cardinality n in respectively 65%, 95%, and 99% of the cases. 

6 Conclusions 

That super-LOGLOG performs quite well in practice is confirmed by the following 
data from simulations: 



k = log 2 m 


4 


5 


6 


7 


8 


9 


10 


11 


12 


a* 


29.5 


19.8 


13.8 


9.4 


6.5 


4.5 


3.1 


2.2 


1.5 


1.05 A/m 


26.3 


18.6 


13.1 


9.3 


6.5 


4.6 


3.3 


2.3 


1.6 


Random 


22 


16 


11 


8 


6 


4 


3 


2.3 


2 


KingLear 


8.2 


1.6 


2.1 


3.9 


2.9 


1.2 


0.3 


1.7 




ShAII 


2.9 


13.9 


4.4 


0.9 


9.4 


4.1 


3.0 


0.8 


0.6 


Pi 


67 


28 


9.7 


8.6 


2.8 


5.1 


1.9 


1.2 


0.7 



Note, a* refers to standard error as estimated from extensive simulations, to be 
compared to the empirical formula 1.05 / y/rn. The next lines display the absolute 
value of the relative error measured. Random refers to averages over 10,000 runs 
with n = 20, 000; the other data are single runs: Pi is formed of 2 • 10 7 records 
that are consecutive 10— digit slices of the first 200 million decimals of ir; ShAII 
is the whole of Shakespeare's works. KingLear is what its name says. (Naturally, 
inherent stochastic fluctuations prevent the estimates from always depending 



616 M. Durand and P. Flajolet 



monotonically on memory size (m) in the case of single runs on a given piece of 
data.) 

As we have strived to demonstrate, the LogLog algorithm in its optimized 
version performs quite well. The following table (grossly) summarizes the accu- 
racy (measured by standard error a) in relation to the storage used for the major 
methods known. Note that different algorithms operate with different memory 
units. 



Algorithm 


Std. Err. (a) 


Memory units 


n = 10 s , cr = 0.02 


Adaptive Sampling 
Prob. Counting 
Multires. Bitmap 
LogLog 
Super-LOGLOG 


1.20 /y/m 
0.78/^m 
» 4A/y/m 
1.30/v/m 
1.05/ y/rri 


Records (> 24-bit words) 
Words (24-32 bits) 
Bits 

"Small bytes" (5 bits) 
"Small bytes" (5 bits) 


10.8 kbytes 

6.0 kbytes 
4.8 kbytes 

2.1 kbytes 
1.7 kbytes 



The last column is a rough indication of the storage requirement for an accuracy 
of 2% and a file of cardinality 10 s . (The formula for Multiresolution Bitmap is 
a crude extrapolation based on data of [3].) 

Distributing or parallelizing the algorithm is trivial: it suffices to have dif- 
ferent processors (sharing the same hash function) operate on different slices of 
the data and then "max-merge" their tables of registers. Optimal speed-up is 
clearly attained and interprocess communication is limited to just a few kilo- 
bytes. Requirements for an embedded hardware design are absolutely minimal 
as only addressing, register comparisons, and integer addition are needed. 
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